Introduction :

In this project report, we try to analyse the Housing price data-set. It contains the housing data for real estate markets in Sydney and Melbourne. The data-set contains unknown data trends. There are various attributes that are suspected to be influencing the housing prices linearly. Here in the project, we try to find these predictor variables based on which, we try to estimate the house prices.

Importing all neccesary libraries

#install.packages("caret")
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ✔ purrr   0.3.4      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(corrplot)
## corrplot 0.92 loaded
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caTools)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

Importing dataset

The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The dataset was obtained from Kaggle. Prediction of property prices is becoming increasingly important and beneficial. Considering the data provided, we are wrangling a large set of property sales records stored in an unknown format and with unknown data quality issues.

Columns : date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15

data <- read.csv(file = 'house_prediction_data.csv')
head(data)
print(paste("Number of records: ", nrow(data)))
## [1] "Number of records:  21597"
print(paste("Number of features: ", ncol(data)))
## [1] "Number of features:  21"
summary(data)
##        id                date               price            bedrooms     
##  Min.   :1.000e+06   Length:21597       Min.   :  78000   Min.   : 1.000  
##  1st Qu.:2.123e+09   Class :character   1st Qu.: 322000   1st Qu.: 3.000  
##  Median :3.905e+09   Mode  :character   Median : 450000   Median : 3.000  
##  Mean   :4.580e+09                      Mean   : 540297   Mean   : 3.373  
##  3rd Qu.:7.309e+09                      3rd Qu.: 645000   3rd Qu.: 4.000  
##  Max.   :9.900e+09                      Max.   :7700000   Max.   :33.000  
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.500   Min.   :  370   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1430   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.116   Mean   : 2080   Mean   :  15099   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10685   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##    waterfront            view          condition        grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.00   Min.   : 3.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.00   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.00   Median : 7.000  
##  Mean   :0.007547   Mean   :0.2343   Mean   :3.41   Mean   : 7.658  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.00   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.00   Max.   :13.000  
##    sqft_above   sqft_basement       yr_built     yr_renovated    
##  Min.   : 370   Min.   :   0.0   Min.   :1900   Min.   :   0.00  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.00  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.00  
##  Mean   :1789   Mean   : 291.7   Mean   :1971   Mean   :  84.46  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.00  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.00  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12758  
##  3rd Qu.: 10083  
##  Max.   :871200

Modifying the data

When we look at house prices, we believe that price has strong dependency on age and number of times its been renovated. Hence, we will add these two information on our dataset.

saleDate = mdy( data$date )
data$saleDateYr = as.integer(year(saleDate))
data$age = data$saleDateYr - data$yr_built

data$reno = ifelse( data$yr_renovated==0,0,1 )
data$reno = as.numeric( data$reno )

Feature Selectipn

We want to filter out few columns which we believe should be our independent variables for our model. Thus, we will create a separate dataframe filtering out these columns.

maindf <- data[, c( "bedrooms", "bathrooms", "sqft_living", "view", "grade", "age", "waterfront", "long", "lat", "zipcode",
                   "condition", "sqft_above", "sqft_living15", "reno", "price", "sqft_lot", "floors" ) ]

head(maindf,10)

##Checking Null values Let’s check if we have any Null values which will be needed to weed out as part of our data cleaning.

sum(is.na(maindf))
## [1] 0

Diving into train and test data

set.seed(123)   #  set seed to ensure you always have same random numbers generated

trainIndex <- createDataPartition(maindf$price, p=0.8, list=FALSE)
train_data <- maindf[ trainIndex,]
test_data <- maindf[-trainIndex,]

Correlation matrix

round(cor(train_data), digits = 2 )
##               bedrooms bathrooms sqft_living  view grade   age waterfront  long
## bedrooms          1.00      0.53        0.59  0.08  0.36 -0.16       0.00  0.13
## bathrooms         0.53      1.00        0.76  0.19  0.66 -0.50       0.06  0.22
## sqft_living       0.59      0.76        1.00  0.28  0.76 -0.31       0.10  0.24
## view              0.08      0.19        0.28  1.00  0.25  0.06       0.41 -0.08
## grade             0.36      0.66        0.76  0.25  1.00 -0.45       0.08  0.19
## age              -0.16     -0.50       -0.31  0.06 -0.45  1.00       0.03 -0.41
## waterfront        0.00      0.06        0.10  0.41  0.08  0.03       1.00 -0.04
## long              0.13      0.22        0.24 -0.08  0.19 -0.41      -0.04  1.00
## lat              -0.01      0.02        0.05  0.00  0.11  0.15      -0.01 -0.13
## zipcode          -0.16     -0.20       -0.20  0.08 -0.19  0.35       0.03 -0.56
## condition         0.02     -0.13       -0.06  0.04 -0.14  0.36       0.01 -0.10
## sqft_above        0.49      0.69        0.88  0.17  0.76 -0.42       0.07  0.34
## sqft_living15     0.40      0.57        0.76  0.28  0.71 -0.32       0.08  0.33
## reno              0.01      0.05        0.05  0.11  0.01  0.23       0.10 -0.07
## price             0.32      0.53        0.70  0.40  0.67 -0.05       0.27  0.02
## sqft_lot          0.04      0.08        0.16  0.07  0.10 -0.04       0.02  0.22
## floors            0.18      0.50        0.36  0.04  0.46 -0.49       0.02  0.12
##                 lat zipcode condition sqft_above sqft_living15  reno price
## bedrooms      -0.01   -0.16      0.02       0.49          0.40  0.01  0.32
## bathrooms      0.02   -0.20     -0.13       0.69          0.57  0.05  0.53
## sqft_living    0.05   -0.20     -0.06       0.88          0.76  0.05  0.70
## view           0.00    0.08      0.04       0.17          0.28  0.11  0.40
## grade          0.11   -0.19     -0.14       0.76          0.71  0.01  0.67
## age            0.15    0.35      0.36      -0.42         -0.32  0.23 -0.05
## waterfront    -0.01    0.03      0.01       0.07          0.08  0.10  0.27
## long          -0.13   -0.56     -0.10       0.34          0.33 -0.07  0.02
## lat            1.00    0.27     -0.02       0.00          0.05  0.03  0.31
## zipcode        0.27    1.00      0.00      -0.26         -0.28  0.07 -0.05
## condition     -0.02    0.00      1.00      -0.16         -0.09 -0.06  0.04
## sqft_above     0.00   -0.26     -0.16       1.00          0.73  0.02  0.60
## sqft_living15  0.05   -0.28     -0.09       0.73          1.00 -0.01  0.59
## reno           0.03    0.07     -0.06       0.02         -0.01  1.00  0.12
## price          0.31   -0.05      0.04       0.60          0.59  0.12  1.00
## sqft_lot      -0.09   -0.13      0.00       0.17          0.14  0.01  0.08
## floors         0.05   -0.05     -0.26       0.53          0.28  0.01  0.26
##               sqft_lot floors
## bedrooms          0.04   0.18
## bathrooms         0.08   0.50
## sqft_living       0.16   0.36
## view              0.07   0.04
## grade             0.10   0.46
## age              -0.04  -0.49
## waterfront        0.02   0.02
## long              0.22   0.12
## lat              -0.09   0.05
## zipcode          -0.13  -0.05
## condition         0.00  -0.26
## sqft_above        0.17   0.53
## sqft_living15     0.14   0.28
## reno              0.01   0.01
## price             0.08   0.26
## sqft_lot          1.00  -0.01
## floors           -0.01   1.00
#install.packages("ggcorrplot")
library(ggcorrplot)
corr <- round(cor(train_data), 1)

# Plot
ggcorrplot(corr,
           type = "lower",
           lab = TRUE, 
           lab_size = 5,  
           colors = c("tomato2", "white", "springgreen3"),
           title="Correlogram of Housing Dataset", 
           ggtheme=theme_bw)

According to our corrplot, price is positively correlated with bedroom, bathroom, sqft_living, view , grade, waterfront, sqft_above, sqft_basement, lat, sqft_living 15 and reno columns.

Scatterplot

Let’s plot few scatter plot using our filtered columns for viewing the distribution and prediction with price.

p1=ggplot(data = train_data, aes(x = bedrooms, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Bedrooms and Price", x="bedrooms", y="Price")

p2=ggplot(data = train_data, aes(x = bathrooms, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Bathrooms and Price", x="bathrooms", y="Price")

p3=ggplot(data = train_data, aes(x = sqft_living, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Sqft_living and Price", x="sqft_living", y="Price")

p4=ggplot(data = train_data, aes(x = age, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Age and Price", x="age", y="Price")

p5=ggplot(data = train_data, aes(x = floors, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Floors and Price", x="floors", y="Price")

p6=ggplot(data = train_data, aes(x = reno, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Renovation and Price", x="reno", y="Price")

p7=ggplot(data = train_data, aes(x = view, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of View and Price", x="view", y="Price")

p8=ggplot(data = train_data, aes(x = grade, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Grade and Price", x="grade", y="Price")

grid.arrange( p1, p2, p3, p4, p5, p6, p7, p8, nrow = 4 )

Density plot

We have used density plot on few example columns to view the distribution and type of data.

library(e1071)

par(mfrow=c(2, 3)) 

plot(density(train_data$bedrooms), main="Density Plot: Bedrooms", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(train_data$bedrooms), 2)))  
polygon(density(train_data$bedrooms), col="green")

plot(density(train_data$sqft_living), main="Density Plot: sqft_living", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(train_data$sqft_living), 2)))  
polygon(density(train_data$sqft_living), col="orange")

plot(density(train_data$age), main="Density Plot: age", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(train_data$age), 2)))  
polygon(density(train_data$age), col="green")

plot(density(train_data$reno), main="Density Plot: reno", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(train_data$reno), 2)))  
polygon(density(train_data$reno), col="orange")

plot(density(train_data$floors), main="Density Plot: floors", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(train_data$floors), 2)))  
polygon(density(train_data$floors), col="green")

plot(density(train_data$view), main="Density Plot: view", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(train_data$view), 2)))  
polygon(density(train_data$view), col="orange")

Boxplot for checking outliers

We have used boxplot to view the possible outliers for few example columns.

par(mfrow=c(2, 3))  # divide graph area in 2 columns
boxplot(train_data$bedrooms, main="Bedrooms")
boxplot(train_data$sqft_living, main="sqft_living")
boxplot(train_data$grade, main="grade")
boxplot(train_data$bathrooms, main="bathrooms")
boxplot(train_data$view, main="view")
boxplot(train_data$age, main="age")

Removing outliers :

We see that we have a significantly large number of outliers from the above boxplot. Outliers significantly affect our predictive models. To understand the impact of outliers, we will be plotting scatterplot with and without outliers.

Using the price data, we have removed outliers from the datset to have more accurate results.

outliers = boxplot(train_data$price,plot=FALSE)$out
outliers_data =train_data[which(train_data$price %in% outliers),]
train_data1 = train_data[-which(train_data$price %in% outliers),]

Visualization of scatter plot with and without outliers for some example columns :

par(mfrow=c(1, 2))
plot(train_data$bedrooms, train_data$price, main="With Outliers", xlab="bedrooms", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ bedrooms, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$bedrooms, train_data1$price, main="Outliers removed", xlab="bedrooms", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ bedrooms, data=train_data1), col="blue", lwd=3, lty=2)

par(mfrow=c(1, 2))
plot(train_data$sqft_living, train_data$price, main="With Outliers", xlab="sqft_living", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ sqft_living, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$sqft_living, train_data1$price, main="Outliers removed", xlab="sqft_living", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ sqft_living, data=train_data1), col="blue", lwd=3, lty=2)

par(mfrow=c(1, 2))
plot(train_data$view, train_data$price, main="With Outliers", xlab="view", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ view, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$view, train_data1$price, main="Outliers removed", xlab="view", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ view, data=train_data1), col="blue", lwd=3, lty=2)

par(mfrow=c(1, 2))
plot(train_data$grade, train_data$price, main="With Outliers", xlab="grade", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ grade, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$grade, train_data1$price, main="Outliers removed", xlab="grade", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ grade, data=train_data1), col="blue", lwd=3, lty=2)

par(mfrow=c(1, 2))
plot(train_data$bathrooms, train_data$price, main="With Outliers", xlab="bathrooms", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ bathrooms, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$bathrooms, train_data1$price, main="Outliers removed", xlab="bathrooms", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ bathrooms, data=train_data1), col="blue", lwd=3, lty=2)

par(mfrow=c(1, 2))
plot(train_data$age, train_data$price, main="With Outliers", xlab="age", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ age, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$age, train_data1$price, main="Outliers removed", xlab="age", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ age, data=train_data1), col="blue", lwd=3, lty=2)

par(mfrow=c(1, 2))
plot(train_data$waterfront, train_data$price, main="With Outliers", xlab="waterfront", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ waterfront, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$waterfront, train_data1$price, main="Outliers removed", xlab="waterfront", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ waterfront, data=train_data1), col="blue", lwd=3, lty=2)

par(mfrow=c(1, 2))
plot(train_data$sqft_above, train_data$price, main="With Outliers", xlab="sqft_above", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ sqft_above, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$sqft_above, train_data1$price, main="Outliers removed", xlab="sqft_above", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ sqft_above, data=train_data1), col="blue", lwd=3, lty=2)

par(mfrow=c(1, 2))
plot(train_data$lat, train_data$price, main="With Outliers", xlab="lat", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ lat, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$lat, train_data1$price, main="Outliers removed", xlab="lat", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ lat, data=train_data1), col="blue", lwd=3, lty=2)

par(mfrow=c(1, 2))
plot(train_data$sqft_living15, train_data$price, main="With Outliers", xlab="sqft_living15", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ sqft_living15, data=train_data), col="blue", lwd=3, lty=2)

# Plot of original data without outliers. Note the change of slope.

plot(train_data1$sqft_living15, train_data1$price, main="Outliers removed", xlab="sqft_living15", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ sqft_living15, data=train_data1), col="blue", lwd=3, lty=2)

From these scatter plots, we conclude that the relationship between price and bedroom, bathroom, sqft_living, sqft_above, lat and sqft_living15 is linear. From the above plots, we could see that the change in slope of the best fit line after removing the outliers. This is due to adjust the large errors generated due to outliers for the regression line.

Univariate linear regression plot between price and other independent variables

Multivariate plot

From the corrplot, we picked the independent variables which have positive and higher values wrt price. We will be fitting linear regression to find the accuracy on training and test data.

linearmodel = lm(price ~ bedrooms+bathrooms+sqft_living+view+grade+sqft_lot+age+floors+waterfront, 
                 data = train_data1)
summary(linearmodel)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + view + 
##     grade + sqft_lot + age + floors + waterfront, data = train_data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -700039  -93309   -6537   81522  792471 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.453e+05  1.191e+04 -54.184  < 2e-16 ***
## bedrooms    -1.450e+04  1.573e+03  -9.219  < 2e-16 ***
## bathrooms    3.083e+04  2.538e+03  12.149  < 2e-16 ***
## sqft_living  8.563e+01  2.605e+00  32.869  < 2e-16 ***
## view         2.348e+04  1.821e+03  12.895  < 2e-16 ***
## grade        1.006e+05  1.626e+03  61.913  < 2e-16 ***
## sqft_lot    -6.246e-03  2.701e-02  -0.231 0.817156    
## age          2.772e+03  4.838e+01  57.295  < 2e-16 ***
## floors       3.769e+04  2.506e+03  15.042  < 2e-16 ***
## waterfront   6.935e+04  2.101e+04   3.301 0.000967 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 137200 on 16336 degrees of freedom
## Multiple R-squared:  0.5666, Adjusted R-squared:  0.5664 
## F-statistic:  2373 on 9 and 16336 DF,  p-value: < 2.2e-16

Using the above columns, we see that the relationship between price and the chosen independent variable is moderately strong as R-squared value is not too high. Also, the p-value of sqft_lot is not very significant. Hence, we could drop it.

Detecting influencial outliers

Declaring an observation as an outlier based on a just one (not pivotal) feature could lead to unrealistic inferences. When we look at one variable if it is extreme it or not, it should be beeter looked collectively by considering all the features. These are influencial outliers which effect the data predictions by producing unjustified results. We used cooks distance to find out such outlier’s.

cooksd <- cooks.distance(linearmodel)

Now we plot the cook’s distance.

par(mfrow=c(1, 1))
plot(cooksd, main="Influential Obs by Cooks distance",xlim=c(0,5000),ylim=c(0,0.1))
axis(1, at=seq(0, 5000, 2000))
axis(2, at=seq(0, 0.1, 0.01))
abline(h = 4*mean(cooksd, na.rm=T), col="green")  
text(x=1:length(cooksd)+1,y=cooksd,labels=ifelse(cooksd>4*mean(cooksd,na.rm=T),names(cooksd),""), col="red") 

we find out the influential points in the data and take out influencial outliers.

influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])  # influential row numbers
head(train_data[influential, ],10)
influential_data = train_data[influential, ]

influencial_outliers=inner_join(outliers_data,influential_data)
## Joining, by = c("bedrooms", "bathrooms", "sqft_living", "view", "grade", "age",
## "waterfront", "long", "lat", "zipcode", "condition", "sqft_above",
## "sqft_living15", "reno", "price", "sqft_lot", "floors")
train_data2 = rbind( train_data1,influencial_outliers)

Modeling on training data :

Training on train data and its accuracy

We will try to fit the model with few other variables than last time with removed outliers value and stop it when we get maximum R-squared value.

linearmodel_trn = lm( data = train_data2, price~bedrooms+bathrooms+sqft_living+view+grade+age+waterfront+long+lat+zipcode+condition+sqft_above+sqft_living15+reno)
summary(linearmodel_trn)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + view + 
##     grade + age + waterfront + long + lat + zipcode + condition + 
##     sqft_above + sqft_living15 + reno, data = train_data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -567961  -75433   -7472   63376 2051290 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.556e+07  1.927e+06  -8.072 7.41e-16 ***
## bedrooms      -1.141e+04  1.369e+03  -8.337  < 2e-16 ***
## bathrooms      3.467e+04  2.183e+03  15.881  < 2e-16 ***
## sqft_living    5.905e+01  3.075e+00  19.203  < 2e-16 ***
## view           3.058e+04  1.616e+03  18.919  < 2e-16 ***
## grade          7.575e+04  1.511e+03  50.124  < 2e-16 ***
## age            1.726e+03  4.922e+01  35.061  < 2e-16 ***
## waterfront     2.006e+05  1.796e+04  11.168  < 2e-16 ***
## long          -6.201e+04  8.710e+03  -7.119 1.14e-12 ***
## lat            5.539e+05  7.157e+03  77.391  < 2e-16 ***
## zipcode       -1.927e+02  2.250e+01  -8.564  < 2e-16 ***
## condition      2.509e+04  1.603e+03  15.654  < 2e-16 ***
## sqft_above     2.230e+01  2.805e+00   7.952 1.96e-15 ***
## sqft_living15  4.418e+01  2.526e+00  17.491  < 2e-16 ***
## reno           2.719e+04  5.123e+03   5.308 1.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 119300 on 16356 degrees of freedom
## Multiple R-squared:  0.6888, Adjusted R-squared:  0.6886 
## F-statistic:  2586 on 14 and 16356 DF,  p-value: < 2.2e-16

Model diagonistics on training predictor :

xyplot(resid(linearmodel_trn)~fitted(linearmodel_trn))

histogram(~residuals(linearmodel_trn),width=200000)

library("mosaic")
qqmath(~resid(linearmodel_trn))
ladd(panel.qqmathline(resid(linearmodel_trn)))

From the above residual vs fitted plot, we could see that data is scattered across zero line. Hence, zero variance can be assumed. The histogram and the qq-plot shows a normal distribution, if we ignore the outliers. This could help in validating the normality assumption that is necessary for the linear regression model. Moreover, we con’t have any time series data. Hence, we can assume independece event. The sample size is big so randomization could be considered. Finally, as we see moderately high R-squared value of training model summary, we could confirm that linearity holds.

Thus, we could confirm that the linear regression model satisfies in the context.

Accuracy :

y_pred = linearmodel_trn$fitted.values
y_true = train_data2$price

tally_table=data.frame(actual = y_true, predicted = y_pred)
mean_error = mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)

accuracy = 1-mean_error

print(paste("The accuracy of the train data is = ", accuracy, " or = ",round(accuracy*100, digits = 2), "%" )) 
## [1] "The accuracy of the train data is =  0.795543730343887  or =  79.55 %"

Result on test data and its accuracy

Removing outliers on test data :

outliers = boxplot(test_data$price,plot=FALSE)$out
outliers_data =test_data[which(test_data$price %in% outliers),]
test_data1 = test_data[-which(test_data$price %in% outliers),]

linearmodel_tst = lm(price ~ bedrooms+bathrooms+sqft_living+view+grade+sqft_lot+age+floors+waterfront, 
                 data = test_data1)

cooksd <- cooks.distance(linearmodel_tst)
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])  # influential row numbers
influential_data = test_data[influential, ]

influencial_outliers=inner_join(outliers_data,influential_data)
## Joining, by = c("bedrooms", "bathrooms", "sqft_living", "view", "grade", "age",
## "waterfront", "long", "lat", "zipcode", "condition", "sqft_above",
## "sqft_living15", "reno", "price", "sqft_lot", "floors")
test_data2 = rbind( test_data1,influencial_outliers)

Prediction on test data

linearmodel_tst = lm( data = test_data2, price~bedrooms+bathrooms+sqft_living+view+grade+age+waterfront+long+lat+zipcode+condition+sqft_above+sqft_living15+reno)
summary(linearmodel_tst)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + view + 
##     grade + age + waterfront + long + lat + zipcode + condition + 
##     sqft_above + sqft_living15 + reno, data = test_data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -402426  -74274   -7388   62211 1337677 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.588e+06  3.701e+06  -1.780  0.07517 .  
## bedrooms      -9.659e+03  2.227e+03  -4.338 1.48e-05 ***
## bathrooms      2.829e+04  4.344e+03   6.514 8.21e-11 ***
## sqft_living    7.122e+01  5.872e+00  12.128  < 2e-16 ***
## view           4.065e+04  3.072e+03  13.232  < 2e-16 ***
## grade          7.223e+04  2.905e+03  24.868  < 2e-16 ***
## age            1.647e+03  9.739e+01  16.910  < 2e-16 ***
## waterfront     1.051e+05  3.256e+04   3.229  0.00125 ** 
## long          -3.688e+04  1.670e+04  -2.209  0.02724 *  
## lat            5.583e+05  1.377e+04  40.552  < 2e-16 ***
## zipcode       -2.547e+02  4.312e+01  -5.906 3.79e-09 ***
## condition      2.775e+04  3.082e+03   9.001  < 2e-16 ***
## sqft_above     1.462e+01  5.465e+00   2.675  0.00750 ** 
## sqft_living15  3.763e+01  4.835e+00   7.783 8.89e-15 ***
## reno           3.248e+04  1.094e+04   2.968  0.00302 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114900 on 4080 degrees of freedom
## Multiple R-squared:  0.6962, Adjusted R-squared:  0.6952 
## F-statistic: 667.9 on 14 and 4080 DF,  p-value: < 2.2e-16

Model diagonistics on test predictor :

xyplot(resid(linearmodel_tst)~fitted(linearmodel_tst))

histogram(~residuals(linearmodel_tst),width=200000)

library("mosaic")
qqmath(~resid(linearmodel_tst))
ladd(panel.qqmathline(resid(linearmodel_tst)))

From the above residual vs fitted plot, we could see that data is scattered across zero line. Hence, zero variance can be assumed. The histogram and the qq-plot shows a normal distribution, if we ignore the outliers. This could help in validating the normality assumption that is necessary for the linear regression model. Moreover, we con’t have any time series data. Hence, we can assume independence event. The sample size is big so randomization could be considered. Finally, as we see moderately high R-squared value of training model summary, we could confirm that linearity holds.

Thus, we could confirm that the linear regression model satisfies in the context.

Test Accuracy :

y_pred = linearmodel_tst$fitted.values
y_true = test_data2$price

tally_table=data.frame(actual = y_true, predicted = y_pred)
mean_error = mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)

accuracy = 1-mean_error

print(paste("The accuracy of the test data is = ", accuracy, " or = ",round(accuracy*100, digits = 2), "%" )) 
## [1] "The accuracy of the test data is =  0.798104235768891  or =  79.81 %"

Conclusions :

From the results of the tests that have been carried out, we can conclude that model have produced good accuracy on training and testing data. Hence, the selection of independent variables are based on correlation plot and the factors which affect the housing price in general. We saw an accuracy of 80% approx. for both our training and testing data. Hence, we could conclude that our model is predicting quite well.

However, to get the maximum model, process improvements need to be made again, one of which is checking the model for the trend of overfitting / underfitting models and modeling with other predictors like logistic regression, bayesian methods etc.

References :

  1. https://www.kaggle.com/code/ysthehurricane/house-price-prediction-using-r-programming/notebook

  2. https://www.statology.org/

  3. https://rstudio-pubs-static.s3.amazonaws.com/